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Abstract 

This  paper  introduces  GLARF,  a  frame¬ 
work  for  predicate  argument  structure. 

We  report  on  converting  the  Penn  Tree- 
bank  II  into  GLARF  by  automatic 
methods  that  achieved  about  90%  pre¬ 
cision/recall  on  test  sentences  from  the 
Penn  Treebank.  Plans  for  a  corpus 
of  hand-corrected  output,  extensions  of 
GLARF  to  Japanese  and  applications 
for  MT  are  also  discussed. 

1  Introduction 

Applications  using  annotated  corpora  are  often, 
by  design,  limited  by  the  information  found  in 
those  corpora.  Since  most  English  treebanks  pro¬ 
vide  limited  predicate-argument  (PRED-ARG) 
information,  parsers  based  on  these  treebanks  do 
not  produce  more  detailed  predicate  argument 
structures  (PRED-ARG  structures).  The  Penn 
Treebank  II  (Marcus  et  al.,  1994)  marks  sub¬ 
jects  (SBJ),  logical  objects  of  passives  (EGS), 
some  reduced  relative  clauses  (RRC),  as  well  as 
other  grammatical  information,  but  does  not  mark 
each  constituent  with  a  grammatical  role.  In  our 
view,  a  full  PRED-ARG  description  of  a  sen¬ 
tence  would  do  just  that:  assign  each  constituent 
a  grammatical  role  that  relates  that  constituent  to 
one  or  more  other  constituents  in  the  sentence. 
Eor  example,  the  role  HEAD  relates  a  constituent 
to  its  parent  and  the  role  OBJ  relates  a  constituent 
to  the  HEAD  of  its  parent.  We  believe  that  the 
absence  of  this  detail  limits  the  range  of  appli¬ 
cations  for  treebank-based  parsers.  In  particu¬ 
lar,  they  limit  the  extent  to  which  it  is  possible 
to  generalize,  e.g.,  marking  IND-OBJ  and  OBJ 
roles  allows  one  to  generalize  a  single  pattern  to 
cover  two  related  examples  (“John  gave  Mary  a 
book”  =  “John  gave  a  book  to  Mary”).  Distin¬ 


guishing  complement  PPs  (COMP)  from  adjunct 
PPs  (ADV)  is  useful  because  the  former  is  likely 
to  have  an  idiosyncratic  interpretation,  e.g.,  the 
object  of  “at”  in  “John  is  angry  at  Mary”  is  not 
a  locative  and  should  be  distinguished  from  the 
locative  case  by  many  applications. 

In  an  attempt  to  fill  this  gap,  we  have  begun 
a  project  to  add  this  information  using  both  au¬ 
tomatic  procedures  and  hand-annotation.  We  are 
implementing  automatic  procedures  for  mapping 
the  Penn  Treebank  II  (PTB)  into  a  PRED-ARG 
representation  and  then  we  are  correcting  the  out¬ 
put  of  these  procedures  manually.  In  particular, 
we  are  hoping  to  encode  information  that  will  en¬ 
able  a  greater  level  of  regularization  across  lin¬ 
guistic  structures  than  is  possible  with  PTB. 

This  paper  introduces  GEARE,  the  Grammati¬ 
cal  and  Eogical  Argument  Representation  Erame- 
work.  We  designed  GEARE  with  four  objec¬ 
tives  in  mind:  (1)  capturing  regularizations  — 
noncanonical  constructions  (e.g.,  passives,  tiller- 
gap  constructions,  etc.)  are  represented  in  terms 
of  their  canonical  counterparts  (simple  declara¬ 
tive  clauses);  (2)  representing  all  phenomena  us¬ 
ing  one  simple  data  structure:  the  typed  feature 
structure  (3)  consistently  labeling  all  arguments 
and  adjuncts  for  phrases  with  clear  heads;  and  (4) 
producing  clear  and  consistent  PRED-ARGs  for 
phrases  that  do  not  have  heads,  e.g.,  conjoined 
structures,  named  entities,  etc.  —  rather  than  try¬ 
ing  to  squeeze  these  phrases  into  an  X-bar  mold, 
we  customized  our  representations  to  reflect  their 
head-less  properties.  We  believe  that  a  framework 
for  PRED-ARG  needs  to  satisfy  these  objectives 
to  adequately  cover  a  corpus  like  PTB. 

We  believe  that  GEARE,  because  of  its  uni¬ 
form  treatment  of  PRED-ARG  relations,  will  be 
valuable  for  many  applications,  including  ques¬ 
tion  answering,  information  extraction,  and  ma¬ 
chine  translation.  In  particular,  for  MT,  we  ex- 
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pect  it  will  benefit  proeedures  whieh  learn  trans¬ 
lation  rules  from  syntaetieally  analyzed  parallel 
eorpora,  sueh  as  (Matsumoto  et  al.,  1993;  Mey¬ 
ers  et  al.,  1996).  Mueh  eloser  alignments  will 
be  possible  using  GLARF,  beeause  of  its  multi¬ 
ple  levels  of  representation,  than  would  be  pos¬ 
sible  with  surfaee  strueture  alone  (An  example  is 
provided  at  the  end  of  Seetion  2).  For  this  reason, 
we  are  eurrently  investigating  the  extension  of  our 
mapping  proeedure  to  treebanks  of  Japanese  (the 
Kyoto  Corpus)  and  Spanish  (the  UAM  Treebank 
(Moreno  et  al.,  2000)).  Ultimately,  we  intend  to 
ereate  a  parallel  trilingual  treebank  using  a  eom- 
bination  of  automatie  methods  and  human  eorree- 
tion.  Sueh  a  treebank  would  be  valuable  resouree 
for  eorpus-trained  MT  systems. 

The  primary  goal  of  this  paper  is  to  diseuss  the 
eonsiderations  for  adding  PRED-ARG  informa¬ 
tion  to  PTB,  and  to  report  on  the  performanee  of 
our  mapping  proeedure.  We  intend  to  wait  until 
these  proeedures  are  mature  before  beginning  an¬ 
notation  on  a  larger  seale.  We  also  deseribe  our 
initial  researeh  on  eovering  the  Kyoto  Corpus  of 
Japanese  with  GLARF. 

2  Previous  Treebanks 

There  are  several  eorpora  annotated  with  PRED- 
ARG  information,  but  eaeh  eneode  some  dis- 
tinetions  that  are  different.  The  Susanne  Cor¬ 
pus  (Sampson,  1995)  eonsists  of  about  1/6  of  the 
Brown  Corpus  annotated  with  detailed  syntaetie 
information.  Unlike  GLARE,  the  Susanne  frame¬ 
work  does  not  guarantee  that  eaeh  eonstituent  be 
assigned  a  grammatieal  role.  Some  grammatieal 
roles  (e.g.,  subjeet,  objeet)  are  marked  explieitly, 
others  are  implied  by  phrasetags  (Er  eorresponds 
to  the  GLARE  node  label  SBAR  under  a  REL¬ 
ATIVE  are  label)  and  other  eonstituents  are  not 
assigned  roles  (e.g.,  eonstituents  of  NPs).  Apart 
from  this  eoneern,  it  is  reasonable  to  ask  why 
we  did  not  adapt  this  seheme  for  our  use.  Su¬ 
sanne ’s  granularity  surpasses  PTB-based  GLARE 
in  many  areas  with  about  350  wordtags  (part  of 
speeeh)  and  100  phrasetags  (phrase  node  labels). 
However,  GLARE  would  express  many  of  the  de¬ 
tails  in  other  ways,  using  fewer  node  and  part  of 
speeeh  (POS)  labels  and  more  attributes  and  role 
labels.  In  the  feature  strueture  tradition,  GLARE 
ean  represent  varying  levels  of  detail  by  adding 


or  subtraeting  attributes  or  defining  subsumption 
hierarehies.  Thus  both  Susanne’s  NPlp  word- 
tag  and  Penn’s  NNP  wordtag  would  eorrespond 
to  GLARE’S  NNP  POS  tag.  A  GLARE-style 
Susanne  analysis  of  “Ontario,  Canada”  is  (NP 
(PROVINCE  (NNP  Ontario))  (PUNCTUATION 
(,  ,))  (COUNTRY  (NNP  Canada))  (PATTERN 
NAME)  (SEM-EEATURE  LOG)).  A  GLARE- 
style  PTB  analysis  uses  the  roles  NAMEl  and 
NAME2  instead  of  PROVINCE  and  COUNTRY, 
where  name  roles  (NAMEl,  NAME2)  are  more 
general  than  PROVINCE  and  COUNTRY  in  a 
subsumption  hierarehy.  In  eontrast,  attempts  to 
eonvert  PTB  into  Susanne  would  fail  beeause  de¬ 
tail  would  be  unavailable.  Similarly,  attempts  to 
eonvert  Susanne  into  the  PTB  framework  would 
lose  information.  In  summary,  GLARE’S  ability 
to  represent  varying  levels  of  detail  allows  dif¬ 
ferent  types  of  treebank  formats  to  be  eonverted 
into  GLARE,  even  if  they  eannot  be  eonverted  into 
eaeh  other.  Perhaps,  GLARE  ean  beeome  a  lingua 
franea  among  annotated  treebanks. 

The  Negra  Corpus  (Brants  et  al.,  1997)  pro¬ 
vides  PRED-ARG  information  for  German,  simi¬ 
lar  in  granularity  to  GLARE.  The  most  signifieant 
differenee  is  that  GLARE  regularizes  some  phe¬ 
nomena  whieh  a  Negra  version  of  English  would 
probably  not,  e.g.,  eontrol  phenomena.  Another 
novel  feature  of  GLARE  is  the  ability  to  represent 
paraphrases  (in  the  Harrisian  sense)  that  are  not 
entirely  syntaetie,  e.g.,  nominalizations  as  sen- 
tenees.  Other  sehemes  seem  to  only  regularize 
strietly  syntaetie  phenomena. 

3  The  Structure  of  GLARF 

In  GLARE,  eaeh  sentenee  is  represented  by  a 
typed  feature  strueture.  As  is  standard,  we 
model  feature  struetures  as  single-rooted  direeted 
aeyelie  graphs  (DAGs).  Eaeh  nonterminal  is  la¬ 
beled  with  a  phrase  eategory,  and  eaeh  leaf  is  la¬ 
beled  with  either:  (a)  a  (PTB)  POS  label  and  a 
word  {eat,  fish,  ete.)  or  (b)  an  attribute  value  (e.g., 
singular,  passive,  ete.).  Types  are  based  on  non¬ 
terminal  node  labels,  POSs  and  other  attributes 
(Carpenter,  1992).  Eaeh  are  bears  a  feature  label 
whieh  represents  either  a  grammatieal  role  (SBJ, 
OBJ,  ete.)  or  some  attribute  of  a  word  or  phrase 
(morphologieal  features,  tense,  semantie  features. 


etc.).^  For  example,  the  subject  of  a  sentence  is 
the  head  of  a  SBJ  arc,  an  attribute  like  SINGU¬ 
LAR  is  the  head  of  a  GRAM-NUMBER  arc,  etc. 
A  constituent  involved  in  multiple  surface  or  log¬ 
ical  relations  may  be  at  the  head  of  multiple  arcs. 
For  example,  the  surface  subject  (S-SBJ)  of  a  pas¬ 
sive  verb  is  also  the  logical  object  (L-OBJ).  These 
two  roles  are  represented  as  two  arcs  which  share 
the  same  head.  This  sort  of  structure  sharing  anal¬ 
ysis  originates  with  Relational  Grammar  and  re¬ 
lated  frameworks  (Perlmutter,  1984;  Johnson  and 
Postal,  1980)  and  is  common  in  Feature  Structure 
frameworks  (FFG,  HPSG,  etc.).  Following  (John¬ 
son  et  ah,  1993)^,  arcs  are  typed.  There  are  five 
different  types  of  role  labels: 

•  Attribute  roles:  Gram-Number  (grammati¬ 
cal  number).  Mood,  Tense,  Sem-Feature  (se¬ 
mantic  features  like  temporal/locative),  etc. 

•  Surface-only  relations  (prefixed  wifh  S-), 
e.g.,  fhe  surface  subjecf  (S-SBJ)  of  a  passive. 

•  Fogical-only  Roles  (prefixed  wifh  F-),  e.g., 
fhe  logical  objecf  (F-OBJ)  of  a  passive. 

•  Infermediafe  roles  (prefixed  wifh  I-)  repre¬ 
senting  neifher  surface,  nor  logical  positions. 
In  “John  seemed  fo  be  kidnapped  by  aliens”, 
“John”  is  fhe  surface  subjecf  of  “seem”,  fhe 
logical  objecf  of  “kidnapped”,  and  fhe  in¬ 
fermediafe  subjecf  of  “fo  be”.  Infermedi¬ 
afe  arcs  capfure  are  helpful  for  modeling  fhe 
way  senfences  conform  fo  consfrainfs.  The 
infermediafe  subjecf  arc  obeys  lexical  con- 
sfrainfs  and  conned  fhe  surface  subjecfs  of 
“seem”  (COMFEX  Synfax  class  TO-INF- 
RS  (Macleod  el  ah,  1998a))  fo  fhe  subjecf 
of  fhe  infinitive.  However,  fhe  subjecf  of  fhe 
infinilive  in  Ihis  case  is  nol  a  logical  sub- 
jecf  due  fo  fhe  passive.  In  some  cases,  in¬ 
fermediafe  arcs  are  subjecf  fo  number  agree- 
menl,  e.g.,  in  “Which  aliens  did  you  say 
were  seen?”,  fhe  I-SBJ  of  “were  seen”  agrees 
wifh  “were”. 

•  Combined  surface/logical  roles  (unprefixed 
arcs,  which  we  refer  fo  as  SF-  arcs).  For  ex- 

'  A  few  grammatical  roles  are  nonfunctional,  e.g.,  a  con¬ 
stituent  can  have  multiple  ADV  constituents.  We  number 
these  roles  (ADVl,  ADV2,  . . .)  to  preserve  functionality. 

^That  paper  uses  two  arc  types:  category  and  relational. 


ample,  “John”  in  “John  ale  cheese”  would  be 
fhe  largel  of  a  SBJ  subjecf  arc. 

Fogical  relalions,  encoded  wifh  SF-  and  F- 
arcs,  are  defined  more  broadly  in  GFARF  lhan 
in  mosl  frameworks.  Any  regularizafion  from  a 
non-canonical  linguislic  slrucfure  fo  a  canonical 
one  resulls  in  logical  relalions.  Following  (Harris, 
1968)  and  olhers,  our  model  of  canonical  linguis¬ 
tic  slrucfure  is  fhe  fensed  aclive  indicalive  sen¬ 
tence  wifh  no  missing  argumenls.  The  following 
argumenl  types  will  be  al  fhe  head  of  logical  (F-) 
arcs  based  on  counlerparls  in  canonical  sentences 
which  are  al  fhe  head  of  SF-  arcs:  logical  argu¬ 
menls  of  passives,  underslood  subjecfs  of  infini¬ 
tives,  understood  fillers  of  gaps,  and  inlerpreled 
argumenls  of  nominalizalions  (In  “Rome’s  de- 
slruclion  of  Carlhage”,  “Rome”  is  fhe  logical  sub- 
jecl  and  “Carlhage”  is  fhe  logical  objecf).  While 
canonical  senlence  slrucfure  provides  one  level 
of  regularizafion,  canonical  verb  argumenl  slruc- 
lures  provide  anolher.  In  fhe  case  of  argumenl  al- 
lernafions  (Fevin,  1993),  fhe  same  role  marks  an 
alfernaling  argumenl  regardless  of  where  if  occurs 
in  a  sentence.  Thus  “fhe  man”  is  fhe  indirecl  ob- 
jecl  (IND-OBJ)  and  “a  dollar”  is  fhe  direcl  objecf 
(OBJ)  in  bolh  “She  gave  fhe  man  a  dollar”  and 
“She  gave  a  dollar  fo  fhe  man”  (fhe  dalive  aller- 
nalion).  Similarly,  “fhe  people”  is  fhe  logical  ob- 
jecl  (F-OBJ)  of  bolh  “The  people  evacuated  from 
fhe  lown”  and  “The  Iroops  evacualed  fhe  people 
from  fhe  town”,  when  we  assume  fhe  appropriafe 
regularizafion.  Encoding  Ibis  informalion  allows 
applications  to  generalize.  For  example,  a  single 
Informalion  Exlraclion  paflern  lhal  recognizes  fhe 
IND-OBJ/OBJ  dislinclion  would  be  able  to  han¬ 
dle  Ihese  Iwo  examples.  Wilhoul  Ibis  dislinclion, 
2  paflerns  would  be  needed. 

Due  fo  fhe  diverse  lypes  of  logical  roles,  we 
sub-type  roles  according  fo  fhe  lype  of  regu- 
larizalion  lhal  Ihey  reflecl.  Depending  on  fhe 
application,  one  can  apply  differenl  fillers  fo  a 
delailed  GFARF  represenfalion,  only  looking  al 
cerfain  types  of  arcs.  For  example,  one  mighl 
choose  all  logical  (F-  and  SF-)  roles  for  an 
application  lhal  is  frying  fo  acquire  seleclion 
reslriclions,  or  all  surface  (S-  and  SF-)  roles 
if  one  was  inleresled  in  oblaining  a  surface 
parse.  For  olher  applications,  one  mighl  wanl  to 
choose  belween  subtypes  of  logical  arcs.  Given 
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Figure  1 :  Penn  representation  of  gapping 

a  trilingual  treebank,  suppose  that  a  Spanish 
treebank  sentence  corresponds  to  a  Japanese 
nominalization  phrase  and  an  English  nominal- 
ization  phrase,  e.g., 

Disney  ha  comprado  Apple  Computers 
Disney’s  acquisition  of  Apple  Computers 

Furthermore,  suppose  that  the  English  treebank 
analyzes  the  nominalization  phrase  both  as  an 
NP  (Disney  =  possessive,  Apple  Computers  = 
object  of  preposition)  and  as  a  paraphrase  of  a 
sentence  (Disney  =  subject,  Apple  Computers 
=  object).  For  an  MT  system  that  aligns  the 
Spanish  and  English  graph  representation,  it 
may  be  useful  to  view  the  nominalization  phrase 
in  terms  of  the  clausal  arguments.  However, 
in  a  Japanese/English  system,  we  may  only 
want  to  look  at  the  structure  of  the  English 
nominalization  phrase  as  an  NP. 

4  GLARF  and  the  Penn  Treebank 

This  section  focuses  on  some  characteristics  of 
English  GEARF  and  how  we  map  PTB  into 
GEARF,  as  exemplihed  by  mapping  the  PTB  rep¬ 
resentation  in  Figure  1  to  the  GEARF  representa¬ 
tion  in  Figure  2.  In  the  process,  we  will  discuss 
how  some  of  the  more  interesting  linguistic  phe¬ 
nomena  are  represented  in  GEARF. 

4.1  Mapping  into  GLARF 

Our  procedure  for  mapping  PTB  into  GEARF 
uses  a  sequence  of  transformations.  The  first 


transformation  applies  to  PTB,  and  the  out¬ 
put  of  each  trans f  ormatiorin  is  the  input  of 
transformationn+i-  As  many  of  these  transfor¬ 
mations  are  trivial,  we  focus  on  the  most  interest¬ 
ing  set  of  problems.  In  addition,  we  explain  how 
GEARF  is  used  to  represent  some  of  the  more  dif¬ 
ficult  phenomena. 

(Brants  et  al.,  1997)  describes  an  effort  to  min¬ 
imize  human  effort  in  the  annotation  of  raw  text 
with  comparable  PRED-ARG  information.  In 
contrast,  we  are  starting  with  annotated  corpus 
and  want  to  add  as  much  detail  as  possible  auto¬ 
matically.  We  are  as  much  concerned  with  Ending 
good  procedures  for  PTB-based  parser  output  as 
we  are  minimizing  the  effort  of  future  human  tag¬ 
gers.  The  procedures  are  designed  to  get  the  right 
answer  most  of  the  time.  Human  taggers  will  cor¬ 
rect  the  results  when  they  are  wrong. 

4.1.1  Conjunctions 

The  treatment  of  coordinate  conjunction  in 
PTB  is  not  uniform.  Words  labeled  CC  and 
phrases  labeled  CONJP  usually  function  as  co¬ 
ordinate  conjunctions  in  PTB.  However,  a  num¬ 
ber  of  problems  arise  when  one  attempts  to  un¬ 
ambiguously  identify  the  phrases  which  are  con¬ 
joined.  Most  significantly,  given  a  phrase  XP 
with  conjunctions  and  commas  and  some  set  of 
other  constituents  yi,...,y„,  it  is  not  always 
clear  which  y*  are  conjuncts  and  which  are  not, 
i.e.,  Penn  does  not  explicitly  mark  items  as  con¬ 
juncts  and  one  cannot  assume  that  ah  Yi  are  con¬ 
juncts.  In  GEARF,  conjoined  phrases  are  clearly 
identified  and  conjuncfs  in  fhose  phrases  are  dis- 
finguished  from  non-conjuncfs.  We  will  discuss 
each  problemafic  case  fhaf  we  observed  in  furn. 

Insfances  of  words  fhaf  are  marked  CC  in  Penn 
do  nol  always  funclion  as  conjuncfions.  They 
may  play  fhe  role  of  a  senfenfial  adverb,  a  preposi- 
fion  or  fhe  head  of  a  parenfhelical  consfifuenls.  In 
GEARF,  conjoined  phrases  are  explicifly  marked 
wifh  fhe  affribufe  value  (CONJOINED  T).  The 
mapping  procedures  recognize  fhaf  phrases  be¬ 
ginning  wifh  CCs,  PRN  phrases  confaining  CCs, 
among  ofhers  are  nof  conjoined  phrases. 

A  sisfer  of  a  conjuncfion  (of her  fhan  a  con- 
juncfion)  need  nof  be  a  conjuncf.  There  are  fwo 
cases.  Firsf  of  ah,  a  sisfer  of  a  conjuncfion  can 
be  a  shared  modifier,  e.g.,  fhe  righf  node  raised 


Figure  2:  GLARF  representation  of  gapping 


PP  modifier  in  “[NP  senior  viee  president]  and 
[NP  general  manager]  [PP  of  this  U.S.  sales  and 
marketing  arm]”;  and  the  loeative  “there”  in  “de¬ 
terring  U.S.  high-teehnology  firms  from  [invesf- 
ing  or  [markefing  fheir  besf  produefs]  fhere]”.  In 
addifion,  fhe  boundaries  of  fhe  eonjoined  phrase 
and/or  fhe  eonjunefs  fhaf  fhey  eonfain  are  omif- 
fed  in  some  environmenfs,  parfieularly  when  sin¬ 
gle  words  are  eonjoined  and/or  when  fhe  phrases 
oeeur  before  fhe  head  of  a  noun  phrase  or  quan- 
fifier  phrase.  Some  phrases  whieh  are  under 
a  single  nonterminal  node  in  fhe  freebank  (and 
are  nol  furlher  broken  down)  inelude  fhe  follow¬ 
ing:  “befween  $190  million  and  $195  million”, 
“Hollingsworfh  &  Vose  Co.”,  “eoffon  and  aeefafe 
fibers”,  “fhose  workers  and  managers”,  “fhis  U.S. 
sales  and  marketing  arm”,  and  “Messrs.  Cray 
and  Barnum”.  To  overeome  fhis  sorf  of  prob¬ 
lem,  proeedures  infroduee  braekefs  and  mark  eon- 
sfifuenls  as  eonjunefs.  Considerations  ineluded 
PCS  eafegories,  similarify  measures,  eonsfruefion 
fype  (e.g.,  &  is  fypieally  parf  of  a  name),  among 
ofher  faefors. 

CONJPs  have  a  differenf  disfribufion  fhan  CCs. 
Differenf  eonsiderafions  are  needed  for  idenfify- 


ing  fhe  eonjunefs.  CONJPs,  unlike  CCs,  ean  oe¬ 
eur  initially,  e.g.,  “[Nof  only]  [was  Fred  a  good 
doefor],  [he  was  a  good  friend  as  well].”).  See- 
ondly,  fhey  ean  be  embedded  in  fhe  firsl  eonjunef, 
e.g.,  “[Fred,  nol  only,  liked  fo  play  doefor],  [he 
was  good  al  if  as  well.]”. 

In  Figure  2,  fhe  eonjunefs  are  labeled  explie- 
ifly  wifh  fheir  roles  CONJl  and  CONJ2,  fhe  eon- 
juncfion  is  labeled  as  CONJUNCTION!  and  fhe 
fop-mosf  VP  is  explieifly  marked  as  a  eonjoined 
phrase  wifh  fhe  alfribufe/value  (CONJOINED  T). 

4.1,2  Applying  Lexical  Resources 

We  merged  fogelher  fwo  lexieal  resourees 
NOMLEX  (Maeleod  el  al,  1998b)  and  COM- 
EEX  Synlax  3.1  (Maeleod  el  ah,  1998a),  deriv¬ 
ing  PP  eomplemenls  of  nouns  from  NOMEEX 
and  using  COMEEX  for  ofher  types  of  lexieal 
informafion.We  use  Ihese  resourees  lo  help  add 
additional  braekefs,  make  addilional  role  disline- 
lions  and  fill  a  gap  when  ils  filler  is  nol  marked 
in  PTB.  Allhough  Penn’s  -CER  lags  are  good  in- 
dieafors  of  eomplemenl-hood,  fhey  only  apply  lo 
verbal  eomplemenls.  Thus  proeedures  for  making 
adjunet/eomplemenl  dislinelions  benefited  from 
fhe  dielionary  elasses.  Similarly,  COMEEX’s 


NP-FOR-NP  class  helped  identify  those  -BNF 
constituents  which  were  indirect  objects  (“John 
baked  Mary  a  cake”,  “John  baked  a  cake  [for 
Mary]”).  The  class  PRE-ADJ  identified  those  ad¬ 
verbial  modifiers  wifhin  NPs  which  really  mod¬ 
ify  fhe  adjecfive.  Thus  we  could  add  fhe  follow¬ 
ing  brackefs  fo  fhe  NP:  “[even  brief]  exposures”. 
NTITLE  and  NUNIT  were  useful  for  fhe  analysis 
of  paffern  type  noun  phrases,  e.g.,  “Presidenf  Bill 
Clinfon”,  “five  million  dollars”.  Our  procedures 
for  idenfifying  fhe  logical  subjecfs  of  infinitives 
make  exfensive  use  of  fhe  confroFraising  proper- 
lies  of  COMEEX  classes.  Eor  example,  X  is  fhe 
subjecl  of  fhe  infinilives  in  “X  appeared  lo  leave” 
and  “X  was  likely  lo  bring  allenfion  lo  fhe  prob¬ 
lem”. 

4.1.3  NEs  and  Other  Patterns 

Over  the  past  few  years,  there  has  been  a  lot  of 
interest  in  automatically  recognizing  named  enti¬ 
ties,  time  phrases,  quantities,  among  other  special 
types  of  noun  phrases.  These  phrases  have  a  num¬ 
ber  of  things  in  common  including:  (1)  their  in¬ 
ternal  structure  can  have  idiosyncratic  properties 
relative  to  other  types  of  noun  phrases,  e.g.,  per¬ 
son  names  typically  consist  of  optional  titles  plus 
one  or  more  names  (first,  middle,  last)  plus  an  op¬ 
tional  post-honorific;  and  (2)  externally,  they  can 
occur  wherever  some  more  typical  phrasal  con¬ 
stituent  (usually  NP)  occurs.  Identifying  these 
patterns  makes  it  possible  to  describe  these  dif¬ 
ferences  in  structure,  e.g.,  instead  of  identifying 
a  head  for  “John  Smith,  Esq.”,  we  identify  two 
names  and  a  posthonorific.  If  this  named  entity 
went  unrecognized,  we  would  incorrectly  assume 
that  “Esq.”  was  the  head.  Currently,  we  merge  the 
output  of  a  named  entity  tagger  to  the  Penn  Tree- 
bank  prior  to  processing.  In  addition  to  NE  tagger 
output,  we  use  procedures  based  on  Penn’s  proper 
noun  wordtags. 

In  Eigure  2,  there  are  four  patterns:  two 
NUMBER  and  two  TIME  patterns.  The  TIME 
patterns  are  very  simple,  each  consisting  just 
of  YEAR  elements,  although  MONTH,  DAY, 
HOUR,  MINUTE,  etc.  elements  are  possible. 
The  NUMBER  patterns  each  consist  of  a  sin¬ 
gle  NUMBER  (although  multiple  NUMBER  con¬ 
stituents  are  possible,  e.g.,  “one  thousand”)  and 
one  UNIT  constituent.  The  types  of  these  patterns 


are  indicated  by  the  PATTERN  attribute. 

4,1,4  Gapping  Constructions 

Eigures  1  and  2  are  corresponding  PTB  and 
GEARE  representations  of  gapping.  Penn  rep¬ 
resents  gapping  via  “parallel”  indices  for  corre¬ 
sponding  arguments.  In  GEARE,  the  shared  verb 
is  at  the  head  of  two  HEAD  arcs.  GEARE  over¬ 
comes  some  problems  with  structure  sharing  anal¬ 
yses  of  gapping  constructions.  The  verb  gap  is  a 
“sloppy”  (Ross,  1967)  copy  of  the  original  verb. 
Two  separate  spending  events  are  represented  by 
one  verb.  Intuitively,  structure  sharing  implies  to¬ 
ken  identity,  whereas  type  identity  would  be  more 
appropriate.  In  addition,  the  copied  verb  need  not 
agree  with  the  subject  in  the  second  conjunct,  e.g., 
“was”,  not  “were”  would  agree  with  the  second 
conjunct  in  “the  risks  werci  too  high  and  the  po¬ 
tential  payoff  Cj  too  far  in  the  future”.  It  is  thus 
problematic  to  view  the  gap  as  identical  in  ev¬ 
ery  way  to  the  filler  in  this  case.  In  GEARE,  we 
can  thus  distinguish  the  gapping  sort  of  logical  arc 
(E-GAPPING-HEAD)  from  the  other  types  of  E- 
HEAD  arcs.  We  can  stipulate  that  a  gapping  logi¬ 
cal  arc  represents  an  appropriately  inflected  copy 
of  the  phrase  at  the  head  of  that  arc. 

In  GEARE,  the  predicate  is  always  explicit. 
However,  Penn’s  representation  (H.  Koti,  pc)  pro¬ 
vides  an  easy  way  to  represent  complex  cases, 
e.g.,  “John  wanted  to  buy  gold,  and  Mary  *gap* 
silver.  In  GEARE,  the  gap  would  be  filled  by  the 
nonconstituent  “wanted  to  buy”.  Unfortunately, 
we  believe  that  this  is  a  necessary  burden.  A 
goal  of  GEARE  is  to  explicitly  mark  all  PRED- 
ARG  relations.  Given  parallel  indices,  the  user 
must  extract  the  predicate  from  the  text  by  (imper¬ 
fect)  automatic  means.  The  current  solution  for 
GEARE  is  to  provide  multiple  gaps.  The  second 
conjunct  of  the  example  in  question  would  have 
the  following  analysis:  (S  (SBJ  Maryj)  (PRD 
(VP  (HEAD  GAPi)  (COMP  (S  NPj  (PRD  (VP 
(HEAD  GAPk)  (OBJ  silver)))))))),  where  GAP^ 
is  filled  by  “wanted”,  GAP^  is  filled  by  “to  buy” 
and  NPj  is  bound  to  Mary. 

5  Japanese  GLARF 

Japanese  GEARE  will  have  many  of  the  same 
specifications  described  above.  To  illustrate  how 
we  will  extend  GEARE  to  Japanese,  we  discuss 
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Figure  3:  Stacked  Postpositions  in  GLARF 

two  difficult-to-represent  phenomena:  elision  and 
stacked  postpositions. 

Grammatical  analyses  of  Japanese  are  often  de¬ 
pendency  trees  which  use  postpositions  as  arc  la¬ 
bels.  Arguments,  when  elided,  are  omitted  from 
the  analysis.  In  GLARF,  however,  we  use  role 
labels  like  SBJ,  OBJ,  IND-OBJ  and  COMP  and 
mark  elided  constituents  as  zeroed  arguments.  In 
the  case  of  stacked  postpositions,  we  represent  the 
different  roles  via  different  arcs.  We  also  rean¬ 
alyze  certain  postpositions  as  being  complemen¬ 
tizers  (subordinators)  or  adverbs,  thus  excluding 
them  from  canonical  roles.  By  reanalyzing  this 
way,  we  arrived  at  two  types  of  true  stacked  post¬ 
positions:  nominalization  and  topicalization.  For 
example,  in  Figure  3,  the  topicalized  NP  is  at  the 
head  of  two  arcs,  labeled  S-TOP  and  L-COMP 
and  the  associated  postpositions  are  analyzed  as 
morphological  case  attributes. 

6  Testing  the  Procedures 

To  test  our  mapping  procedures,  we  apply  them 
to  some  PTB  files  and  then  correct  the  result¬ 
ing  representation  using  ANNOTATE  (Brants  and 
Plaehn,  2000),  a  program  for  annotating  edge- 
labeled  trees  and  DAGs,  originally  created  for  the 
NEGRA  corpus.  We  chose  both  files  that  we  have 
used  extensively  to  tune  the  mapping  procedures 
(training)  and  other  files.  We  fhen  converf  fhe 


resulfing  GEARE  Eeafure  Sfrucfures  info  friples 
of  fhe  form  {Role-Name  Pivof  Non-Pivof}  for  all 
logical  arcs  (cf.  (Caroll  ef  ah,  1998)),  using  some 
aufomafic  procedures.  The  “pivof”  is  fhe  head  of 
headed  sfrucfures,  buf  may  be  some  ofher  con- 
sfifuenf  in  non-headed  sfrucfures.  Eor  example, 
in  a  conjoined  phrase,  fhe  pivof  is  fhe  conjunc¬ 
tion,  and  fhe  head  would  be  fhe  lisf  of  heads  of 
fhe  conjuncfs.  Rafher  fhan  lisfing  fhe  whole  Pivof 
and  non-pivof  phrases  in  fhe  friples,  we  simply 
lisf  fhe  heads  of  fhese  phrases,  which  is  usually 
a  single  word.  Einally,  we  compufe  precision  and 
recall  by  comparing  fhe  friples  generafed  from  our 
procedures  fo  friples  generafed  from  fhe  correcfed 
GEARE.  ^  An  exacf  mafch  is  a  correcf  answer  and 
anyfhing  else  is  incorrecf.^ 

6.1  The  Test  and  the  Results 

We  developed  our  mapping  procedures  in  fwo 
sfages.  We  implemenfed  some  mapping  proce¬ 
dures  based  on  PTB  manuals,  relafed  papers  and 
acfual  usage  of  labels  in  PTB.  Affer  our  initial  im- 
plemenfafion,  we  funed  fhe  procedures  based  on  a 
fraining  sef  of  64  senfences  from  fwo  PTB  files: 
wsj_0003  and  wsj_0051,  yielding  1285  -i-  friples. 
Then  we  fesfed  fhese  procedures  againsf  a  fesf  sef 
consisting  of  65  senfences  from  wsj_0089  (1369 
friples).  Our  resulfs  are  provided  in  Eigure  4.  Pre¬ 
cision  and  recall  are  calculafed  on  a  per  senfence 
basis  and  fhen  averaged.  The  precision  for  a  sen- 
fence  is  fhe  number  of  correcf  friples  divided  by 
fhe  fofal  number  of  friples  generafed.  The  recall 
is  fhe  fofal  number  of  correcf  friples  divided  by 
fhe  fofal  number  of  friples  in  fhe  answer  key. 

Ouf  of  1 87  incorrecf  friples  in  fhe  fesf  corpus, 
31  reflecfed  fhe  incorrecf  role  being  selected,  e.g., 
fhe  adjunct/complemenf  distinction,  139  reflecfed 
errors  or  omissions  in  our  procedures  and  7  friples 
relafed  fo  ofher  facfors.  We  expecf  a  sizable  im- 
provemenf  as  we  increase  fhe  size  of  our  frain¬ 
ing  corpus  and  expand  fhe  coverage  of  our  pro- 

^We  admit  a  bias  towards  our  output  in  a  small  num¬ 
ber  of  cases  (less  than  1%).  For  example,  it  is  unimportant 
whether  “exposed  to  it”  modifies  “the  group”  or  “workers” 
in  “a  group  of  workers  exposed  to  it”.  The  output  will  get 
full  credit  for  this  example  regardless  of  where  the  reduced 
relative  is  attached. 

"'(Caroll  et  al.,  1998)  report  about  88%  precision  and  re¬ 
call  for  similar  triples  derived  from  parser  output.  However, 
they  allow  triples  to  match  in  some  cases  when  the  roles  are 
different  and  they  do  not  mark  modifier  relations. 
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Eigure  4: 

Results 

cedures,  particularly  since  one  omission  often  re¬ 
sulted  in  several  incorrect  triples. 

7  Concluding  Remarks 

We  show  that  it  is  possible  to  automatically  map 
PTB  input  into  PRED-ARG  structure  with  high 
accuracy.  While  our  initial  results  are  promising, 
mapping  procedures  are  limited  by  available  re¬ 
sources.  To  produce  the  best  possible  GLARE  re¬ 
source,  hand  correction  will  be  necessary. 

We  are  improving  our  mapping  procedures  and 
extending  them  to  PTB-based  parser  output.  We 
are  creating  mapping  procedures  for  the  Susanne 
corpus,  the  Kyoto  Corpus  and  the  UAM  Tree- 
bank.  This  work  is  a  precursor  to  the  creation  of 
a  trilingual  GLARE  treebank. 

We  are  currently  defining  the  problem  of  map¬ 
ping  treebanks  into  GLARE.  Subsequently,  we  in¬ 
tend  to  create  standardized  mapping  rules  which 
can  be  applied  by  any  number  of  algorithms.  The 
end  result  may  be  that  detailed  parsing  can  be  car¬ 
ried  out  in  two  stages.  In  the  first  stage,  one  de¬ 
rives  a  parse  at  the  level  of  detail  of  the  Penn  Tree- 
bank  II.  In  the  second  stage,  one  derives  a  more 
detailed  parse.  The  advantage  of  such  division 
should  be  obvious:  one  is  free  to  find  fhe  besf  pro¬ 
cedures  for  each  sfage  and  combine  fhem.  These 
procedures  could  come  from  differenl  sources  and 
use  folally  differenl  melhods. 
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